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Abstract 



^^^^ Z °^ Cason's (1984) s mplified model of their deterministic rating 

rnnZ^. ph T of abstracts submitted to Sigma Theta Tau for the International Researcfi 
S^SSfon' ^"^^"^^^ ScoUand. r^ulted in good fit (R >.89) of the model to the data and 
^nWnJL i"t"-reYiewer differences in individual reviewer's 

S!??^' uaI availability of calibrated ratings, i.e., those from which these reviewer 
difterences had been removed, greatly eased the task of the Abstract Selection Committee- 
they no longer needed to dea^ with ratings which were confounded with variation in reviewer 
stnngency (as occurs m the observed mean ratings). Abstract selection was based on ratings 
f/nLTn . ^Sf^'^^^'y. reflected the true quality of the abstract without regard to who 
?n ^° reviewed It. Use by the Abstract Selection Committee of these calibrated 

and^l?Si;?fe'S«<??°^^^^^ ^^^^'y improved both the reliability (from .579 to .810) 
and validity (trom .485 to .742) of the peer review process. 
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Carolyn L Cason and Gerald J. Cason 
University of Arkansas for Medical Sciences, 
Little Rock, AR, USA 

Alice Redland 
University of Texas, Austin, TX, USA 



Peer review serves as the basis for making many highly important decisions: funding 
ot aroposals, publication of manuscripts, papers to be presented at professional meetings 
such as this. As such it determines, in part, what knowledge will be sought by and shared with 
the scientific community and by whom. What is selected and the process by which it is 
selected is very important to both the body of scientific information and the individual 
researcher s professional career. 

A Abstract Selection Committee, Sigma Theta Tau International Honor Society, 
made the decisions about which abstracts submitted to it for the International Research 
Congress, Edinburgh, Scotland would be selected for inclusion in the program. The 
Committee was assisted in its task by multiple reviews completed by volunteer reviewers of 
each abstract submitted. As most of us who have had our work reviewed by different 
reviewers Imow, there seems to be substantial variation in the apparent standards of 
reyncwtrs. To the degree that such variation exists among Sigma Theta Tau reviewers, the 
task ot the Abstract Selection Committee becomes more difficult and both the reliabilitv and 
validity of the process become compromised. 

Interesiingly, with few exceptions, variation in reviewer standards and its impact on 
the review process have been largely unevaluated and even less attention has been aven to 
adjusting for differences m reviewer standards. Two recent exceptions are Marsh and BaU's 
lysi study of vanaUon among reviewers of manuscripts submitted to the Journal of 
Educational Psychology and our own work on paper proposals submitted for consideration in 
the pro-am of Division I: Professions Education, American Educational Research 
Association (Cason, Cason, & Stritter, 1986a and 1986b). Marsh and Ball found no 
significant reviewer effect but this may well have been because their data comained a rather 
large number of manuscripts which had been reviewed by only a single reviewer (excluding 
the journal editors review) and many (i.e., two thirds of the) reviewers who had reviewed 
only one or two manuscripts. In our previous study, we found significam and important 
re^ewer effects in the reviews of paper proposals; effects which reduced both the reMabilitv 
and validity of the selection process. Both of these studies were retrospective, that is they 
used data on selection deasions already made: reviewer effects were not formally considered 
in making actual decisions about acceptance or rejection of the manuscrirt/proposal 

This study of the Sigma Theta Tau abstract review and selection process 'had two 
objectives. They were ^ 

, 1. To evaluate the extent to which variation in standards/stringency exists among 
reviewers* 

th. .ffL.Z^F'S^^^-^^- ^K^Au'^^ Selection Committee, ratings of abstracts from which 
the effects oi such variation had been removed. 

Our performance rating theory and its derivative simplified model served as the 
framework for the study (Cason & tason, 1984; Cason & Cason, 1986). It was briefly 

fnH^^SS^i-r ^"""J? ^ C^o"' 1987). Application of the theor^ 

and simplified model have as an objective detection, quamification, and mathematics 
fn r?5i5 ''^"^^^o" reviewer/rater stringencies. Mathematical control of differences 

L^^^S stringency is mtended to augmem the more usual methods of controlling error 
associated with ratings (e.g., rater traimng, improved inventory reliability, all raters/rewewers 
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!^n?L^Lr^l"V^*'5^*^^ to off-set systematic rater error when such methods are 
impractical. Apphcation of the model may be likened to a calibrafon procedure in which 
knowledge of the reviewers' stnn§encies is used to adjust or calibrate abstract ratings so as to 
take into account the different stringencies of the reviewers who reviewed abstracts. 

Data Source and Methods 

The rating given by individual reviewers to research abstracts sub: litted to Sigma Theta 
Tau International Honor Soaety for the International Research Congress in Edinburgh 
Scotland served as the data for these analyses. Reviewers indicated their ratings of each 
abstract on machine scannable rating sheets specifically prepared for *his use. Figure 1 
illustrates this speaal scan sheet. nguic 



These machine scaimable rating sheets were pre-printed (with a computer's line printer) 
?fTi?ScV???T®"^ °i SP'^P'?®'" Services, Umversity of Arkansas fof Medical Sciences 
LnhiI?^;oli ' •''''i- "»fonnation contained in Figure 1: identifying information 
(subject name or in this case abstract number), criteria or invemory of rati)^ items upon 
which each abstract was rated, and the scale to 'je used for making the ratings. Two copies of 



-o^h • — u I A • J ' , ^ ^"'^ maKing me ratings, iwo copies ot 

each of 650, i.e., abstract identification numbers POOl through P650, were printed. So that 
information about the performance of reviewers could be obtained, a list of 75 unique rater 
identification numbers was provided by the Departmem of Computer Services. Preparation 
ot such rating sheets and identifying numbers is a routine service provided to faculty who 
intend to use the UAMS Objective Test Scoring and Performance Rating (OTS-PR) system 
for climcal performance rating of students enrolled in the various programs on the UAMS 
Campus* 

D '"'®^^E'"^'P^?^®'^ '"i^^S ^^^^^ and the rater identification numbers were sent to the 
Program Office, Sigma Theta Tau International Honor Society. As individuals agreed to 
fS^t /^ voluntary reviewers, they were assigned by a staff member one of the rater 
Identification numbers. As an abstract was received in the Program Office it was logged in 
and given an identification number by the staff of the office. All abstracts (resear^ and 
congre^-related topic) were numbered sequentially as they were received. For those 
abstracts identified as research abstracts, the staff members obtained the rating sheets with 
the corresponding proposal identification number (subject name), selectid the two 
individuals who would serve as reviewers, entered the reviewer's idemification number on the 
appropnate ratmg sheet, and sent the rating sheet and abstract to each of the two reviewers. 
Reviewers for each abstract were selected randomly by the Program Office staff with the only 
restncuon beine that the reviewer not be located in the same institution or general area of 
the country as the author(s) of the abstract. ^ 

Upon receipt of the rating sheet and research abstract, the reviewers recorded their 
ratings of the abstract on each of six general criteria: acceptability for program, overall 
quality of work, contnbution to nursing scholarship, contribution to nursing tSeory, originality 
of work, and clarity and completeness of the abstract. These are shown in Figure 1. Ttit last 
general critenon contained six specific items to be used in evaluating the abstract's clarity 
and completeness: purpose, objective(s), theoretical framework, method/mode of inquiry, 
findings/conclusions,and implications. Thus, reviewers were asked to make a total of 12 
rating on each abstract by filhn§ in the numbered circle to the left of each item on the 
inventory which represented their rating of the abstract under consideration. Possible 
rnrnT.nntl'^^ Outstanding, very good, good, poor, and missing, absem o? ve^r i^or; 
^ir^ applicable and no opimon. Reviewers who wished to make comments about the 
abstract could do so in the space provided to the right of the items. Finally, reviewers were 
asked to sign the sheet and return it to the Program Office. rcvicwcra were 

Completed rating sheets were collected by the Program Office staff and then forwarded to 
nLte°f^T?- ^ °^ rating sheets were forwarded. Of these, only 972 contained 
usable data. Rating sheets were not usable primarily for two reasons: reviewers gave no 
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Figure 1. Pre-printed rating sheet with inventory. 
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ratings but returned the sheets with comments indicating that they were personally familiar 
with the research or its author(s) or they thought that the abstract was not a research abstract 
be considered as a congress-related topic. The 972 usable rating sheets 
contained ratings of 503 abstracts (only one rating sheet was available on 26 abstracts while 
two raung sheets were available on each of the other 477 abstracts). There were a total of 61 
volunteer reviewers. The actual number of research abstracts reviewed by each reviewer 
varied from 4 to 28. On average, each reviewer reviewed 16 abstracts (S.D =5) (These 
reviewers also served as reviewers of abstracts on congress related topics. Their reviews of 
those abstracts are not reflected in these analyses). 

Two sets of analyses were conducted on the rating data provided by the abstract reviewers, 
ihe Item-level observed ratings contained on the rating sheets were processed through the 

^P"^^ "y^?™- PT^'^^ P'"°'^"<^^'^ standard set of reports on subjects 
(abstracts), raters (reviewers), and the assessment procedure (i.e., rating inventory) The 
second set of analvses was accomplished by specialized computer progrins which provide 
estimates of fornial paraneters of the simplified model of our rating theoiy; that is, computer 
programs wmch provide estimates of reviewer stringency and abstract true quality. 
r<^!!!If ^^^^^Hofli^ ^^Slf^e'^ "^."1^ ^J^'S^y specialized application of regression analysis 
tWS of th?^?SJ^^hiP^^ speaalized computer programs also provide estimates of (a) 
the fit of the model with the data, (b) contnbution of reviewer stringlncy and abstract quality 
to the observed ratings (i.e., reviewer and abstract effects), and (c) calibrated ratings which 
reoresent the rating erpected when reviewer effects are removed (i.e., if all reviewers rated 
all abstracts and me ?;/erage for each abstract was used). In the item-level quantitative 
analyses earned out by the OTS-PR programs, scale values were defined as depicted in 
Figure 1, i.e., 5= outstanding, 4=very good, 3=good, 2=poor, l=missing, absent; or very 
poor. Responses of not applicable or no opinion were dealt with as missing data. Estimation 
ot reviewers stringency and abstracts' true quality were completed using a weighted total 
percent score as the observed dependent measure (i.e., regression criterion variable) for each 
reviewer-abstract pair in the observed data (i.e., each ratfiig sheet). This score was obtained 

"/hSSlt^^!?! ^ ' ^"'h ^l^^ items associated with the general criterion, 
abstract clanty and completeness". This score and the rating assigned to iach of the other 
five cntena were summed, divided by the number of criteria (6) and multipled by 100 fie 
caressed as a percentage). This transformation was made to simplify ^d faalitate the 

Jn preparation for analyzing the item-level observed data, the OTS-PR programs were 
modified so that the system could process ratings on up to 400 unique subjects (abstracts) 
The actual number of abstracts reviewed was 503. Time did not permit altering these 
programs again; therefore, in order to complete the initial processing of these ratings the 
rating sheets were divided into two subsets: the first subset contained ratings on 256 
abstracts; the second, ratings on 247 abstracts. In general, the first subset comained those 
rating of abstracts which were forwarded to us first (i.e., about mid-January) while the 
second set contained all othere (i.e., those forwarded between mid-January and January 30) 
Each set of data was processed separately through the OTS-PR system. In order to apply the 
simplified model to the ratings of research abstracts, the data had to be divided into 3 
subsets. Each of the two OTS-PR system data subsets were too large; the DEC-System 10 
mainframe computer, uf .;d to complete all of the analyses, and associated memory and disk 

S?h .^f°"''^."?l•^^S^?'^,^''^^^^^^^^ programs. The three sets were determined randomly 
with the restnction that all ratings for an abstract were contained in a single data set When 
this requirement was not met, the ratings on an abstract were merged into the smaller data 
pi^h^S^f., / ^ ^^I't'^^^t' ^^'■^ *o<^ated in more than one data set.) 

H.t?rJ?^n^^l^'■"K"^'^'? created data subsets (AYE, BEE, and CEE) was evaluated to 

nxsT r vprf T ^^^Tr?Ji A^T^'^'^Pu^*"." '^"^"""^^ "^o^^l- Using programs 

SfV5Tr;P/^£^' ^"'^ LOCATE, each data set was analyzed separate^ for rater 
effects and to obtain estimates of rater stringency and abstract quality. 



' 'i— 

OEP»Fl«ENT BEPOHT (Current Rating) , : 


Prepared 6-reb-87 12il9 by 
(version BO) as implemented 


the UAMS OTS/PH sysl 
at UAMS 


instructor! Carolyn Cason 


Dcpti Special 

Slpu 529 Pnonel5l63 

Subjects ratedi2Sb Absenti 


^ Wlthdrawni o 


^1 Raters Total 


C A ? E G 5 H Y 

1 2 3 4 5 6 





PROPOsALf r 1 

5»6noo0i0 



P— 2^ 



pROPOsALf- 
586000051 



pROPOsALf 
586AO«j069 



-P— 7- 



P03ALf P 10 



PR0PO3ALf 
586000101 



— pROPOr>ALi-P-^ 



566000119 



Haw Polntsi 
5 Point Scorei 



-HanJc-in-riassi- 
Fercentlle ' 
Z Score) 



5 Point gcorei 
HanK in ciasst 
Percentilei 

-Z-^core-i 



Raw Pointsj 
5 Point Scores 
— Hank-in -Classi- 
Pcrccntiiei 
Z Scor^} 

_jia«-Point$* 

5 point score; 
&ank in^Ciassi 
PercentiT - 



Lei 



Raw Polntsi 
5 Point Scorei 
_Han^c-4n-Class^- 
Percentilei 
Z Scores 

-i^!J-?of»iS»* — 
5 Point Scorei 

lasst 



Rank in c 
Pcrcenti 
-Z-Score^ 



Raw Points! 
5 Point Scorei 
-Rank in clasci- 
Percentilei 
Z Scorei 



_Raw~Pointsi 

5 Point Scorei 
Rank in Ciassi 
Percentile! 

-Z— 5c«r-t^ 



2.90 

184- 



til 



28 

442- 



23.18 
3,22 



Hi 



4B5 

-22,60 



25.00 
3,47 

— Iga 



.7 
522 



-.0. 



86 

-62X- 



;.8Q 

3,58 

-92- 
^64 
539 



.11 



lir 



i:l 1:8 

-2J5 153- 

35? 396 

_4,8 3,0- 

4,0 2!5 
48 1^3 

58 9 
-557 396- 



488 

H 



-473 48 



0, 



424 

-J.6- 

-424- 



458 

—519- 



4B6 

-1-8- 

5 

-55 



'I 



11! 



-62^ 643 



4,2 

519 

-1:1- 

8? 



Jii 

All 

—2.4- 

1$6 
IS 
^30- 

2,4 

2j0 

— IS6- 
15 
430 

— 3,0_ 

hi 

_48l . 
3.6 

53 
533 

*'4 

"2 

8- 



61 



4.8 



U — 4! 



iOS 143 
-490 45!- 



.ll9 
53 

-5ai- 



—54- 
502 

41 

-502- 

-'I 

434 

-lit- 

-434- 



-h- 

569 
-4,2- 

41 

-502_ 

4,8 

_3,6_ 

14 

-434- 



ill 



45 



-455- 
4,7 

5^6 

-S14_ 
4.0 



514 
-4,9_ 

87 
-620_ 

4.8 

6oi 

-4,0- 

hi 

b3 

-51^ 



Figure 2. Scores in various units of measure on each abstract. 
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Results 



The observed data analyses completed using the OTS-PR system yielded a variety of 
reports includmg (a) information about the inventory, (b) information about ihe abstracts, 
and, (c) information about the raters/reviewers. Reports on abstracts included: total and 
^tegory (cntenon) scores for each abstract (in various units of measure), illustrated in 
Figure 2; a rank order listing of abstracts including total scores and number of raters, 
mustrated '1 Figure 3; and, individual abstract performance reports which as illustrated in 
Figure 4 both eraphically and numerically depict the subject's performance relative to group 
performance. Reports on reviewers included: the number of subjects/abstracts rated; the 
average raUnes Hven across these subjects (total and category (cnterion) scores in various 
umts); individual rater reports which as illustrated in Figure 5 depict both graphically and 
numencally the rater's performance relative to group performance; and, a rank order listintj 
of the raters in terms of these ratines, illustrated in Figure 6. 

In all of these reports generatedbv the OTS-PR system, average ratings are computed as a 
simple anthmeuc mean of observed scores; the computational procedure most commonly 
used to obtain a summary score. These scores and this computational approach assume that 
the standards of the reviewers/raters are highly similar or at least that differences in 
reviewers will be "Tialanced out" where an average across reviewers is used; an assumption 
that most individuals who have had their performance or products evaluated by different 
persons have sometimes found to be unwarranted. That the standards of the reviewers who 
provided ratings of research abstracts were different is suggested in at least two of the reports 
which OT5-PR produces: the rater performance report (Figure 5) and the rater rank order 
report (Figure 6). The rater/reviewer performance report is intended to give the rater 
information about the standards he or she used relative to other raters who rated 
subjects/abstracts. It was developed as a means of providing feedback to individual raters 
much as the individual performance report provides feedback to the subject about his 
performance. As can be seen from Figure 5, the average ratings given by this reviewer to the 
proposals he/she rated departed from what the average of the (average) ratings given by all 
reviewers. If the average quality of the abstracts rated by each reviewer were the same (as 
would be expected because of random assignment of abstracts to reviewers), then, any 
substaiitial variation in average ratings given by different reviewers would reflect differences 
in standards (i.e., stnngency). 

The rater rank order report (Figure 6) also suggests that the standards used by the 
reviewers who reviewed research abstracts differed. TTiis report provides information only 
about the relative mean observed ratings of the reviewers, i.e., those at the top of the figure 
having assigned higher ratings to their abstracts than those raters at the bottom. The 
presence of such differences in the mean observed ratings and implied differences in 
stand^g/stnngenaes of reviewers makes the determination of the true quality of abstracts 
more difficult: how much of the total or average score is a function of true abstract quality 
and how much it is a reflection of who happened to review the abstract (and their 
standards/stnngency relative to the pool of potential reviewers) remains obscure 

As can be seen in Table 1, application of our simplified model to the data yielded quite 
good fit (R > .89). There were significant differences (p < .0001) in rater standards in each 
of the data sets. For details on the way in which these effects were tested, see the description 
of the statistical models provided in Cason and Cason (1984). Table 1 also shows the rehtive 
contribution of reviewer stringent, proposal quality, and random error to the variance in 
observed ratings in each of the three data sets and for all data sets considered together 
Components of variance in Table 1 were estimated as a sum of the products of the respective 
standardized weights (Betaj) and correlations (rj ) between predictor variables (one binary 
vector per reviewer and one per abstract) and the 'criterion in the regression equation. 
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(Equation 1) 
Proportion of Variance = (Beta^ • rjy) 

where i = 1 to n abstracts; or, 1 to k reviewers. 

The summation of products is across the set of either reviewers or abstracts (Hays, 1963). 
Across all three data sets, differences in reviewer stringency accounted for very nearly as 
much variance (.404) in the observed raw ratings as did true abstract quality (.407). 

Table 1 

Fit of Cason and Cason's Model to Research Abstracts 
Submitted for the Edinburgh Conference 





All 


Data Subsets 


Multiple R 


Data 


AYE 


BEE 


CEE 


.899 


.901 


.905 


.894 


Components of Variance 








Reviewers 


.404 


.401 


.397 


.416 


Abstracts 


.407 


.410 


.423 


.383 


Error 


.192 


.189 


.201 


.181 


Number of 










Reviewers 


61 


50 


48 


49 


Abstracts 


503 


176 


162 


168 


Observations 


972 


344 


297 


331 



Note: All Rs are significant at p < .00001. Rater effects were significant at p < .0001. 

The origin of the stringency/ability scale Was set in each analysis by assigning an 
arbitranly chosen reviewer a stringency of 500. This produced apparent differences in both 
mean reviewer stringency and mean abstract quality for each data set. These differences are 
shown in Table 2 and are labeled as "preliminary" means. Because abstracts were randomly 
assigned to the three groups, it was reasonable to assume that mean abstract quality was 
equal across groups. There being fa.- more abstracts than reviewers the sampling error of the 
mean for abstract quality was smaller (as shown in Table 2). Therefore, mean abstract 
quality was better suited as a basis for calibrating the results of the analyses on the three 
subsets of data. 

Table 2 

Means and Standard Errors of Model Parameters 
in Each of the Three Data Sets 

Reviewer Stringencies 

Preliminary Mean 

Calibrated Mean 

Standard Error 
Abstra:t Quality 

Preliminary Mean 

Calibrated Mean 

Standard Error 

Calibrated values from the separate analyses were obtained by adding a constant to each of 
the stringency and ability parameters obtained in data set BEE (26.25) and CEE (34.85) such 
that the mean abstract quality for each group equaled the mean abstract quality found in 
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AYE 


BEE 


CEE 


518 


482 


468 


518 


508 


503 


9 


12 


14 


570.14 


543.89 


535.30 


570.14 


570.14 


570.14 


4.20 


5.90 


4.50 



ft • 
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group A YE. This may be understood as moving the origin (zero point) of the scale for 
groups BEE and CEE 26 and 35 points respectively while not altering the distances between 
the reviewer stringencies or abstract qualities within each group. The effect on the means is 
shown in Table 2 labeled "calibrated". Note that there are small differences between 
reviewer means, much smaller than before calibration. However, after calibration the 
aincrences in mean stringencies is well within expected sampling fluctuations, as would be 
expected because abstracts were randomly assigned to reviewers. 

In principle the abstract quality parameter values could be directly used for the selection of 
abstract. They do represent the best estimate of each abstract's quality independent of the 
standards (stringency) of the particular reviewers that rated given abstracts. However, they 
are on an unfamiliar scale and not easily interpretable in terms of the definitions on the 
original rating inventory. Therefore a calibrated (adjusted) rating was computed for each 
abstract that was in percent units on the original rating scale. The calibrated rating was 
computed as the mean of the expected ratings an abstract would have received had ^1 the 
reviewers (in all the sub-groups) rated all the abstracts. For a given abstract, its calibrated 
abstract quality parameter value and the stringency parameter values of all raters in all 
groups were used to obtained expected ratings; then, the mean of these was taken as the 
ca ibrated (adjusted) rating for that abstract. Program NULOCS was used to generate the 
calibrated parameters and then the calibrated ratings. Table 3 shows that this approach 
achieved highly similar means and standard errors in calibrated ratings across the three sets 
as would be expected from the assumed equal mean quality of abstracts arising from random 
assignment. 

Table 3 

Mean, Standard Error, and Standard Deviation of Calibrated Ratings 

AYE BEE CEE 

Mean 66.40 65.50 66.00 

Standard Error 1.17 1.38 1.17 

Standard Deviation 15.50 17.51 15.20 

According to Hays (1963, p. 424), the intra-class correlation (r,v) is a function of the 
variance attributable to an effect (cTg) as a proportion of total variance. 

(Equation 2) 
'■ic=<ra/(cr*a+<T'e) 

The proportion of variance attributable to proposal quality in Table 1 can thus be interpreted 
as the intra-class correlation of reviewers with respect to their observed ratings of abstracts. 
As Hays points out, this is equivalent to the reliability of a single reviewer's observed rating. 
Alternatively, this value may be interpreted as the expected correlation between the rating 
given by randomly chosen pairs of reviewers. The reliability of a mean of several reviewers' 
ratings, as is available in these data (where number of reviewers = k), is given by the 
Spearman-Brown expansion formula: 

(Equation 3) 
rk = (k * r)/(l + ((k - 1) * r)) 

wb-re r = the reliability of a unit length measure, in this case a single reviewer- and 
- number of reviewers. 

Table 4 shows the impact of calibrating ratings on the reliability of both a single 
reviewer and aggregate ratings calculated from 2 reviewers. The reliabilities for the single 
reviewer calibrated ratings were obtained by including only the sum of the random error and 
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abstract variances in the denominator in Equation 2. The reliabilities of the observed ratings 
must include the variance associated with reviewers in addition to that associated with 
proposals and error (Ebel, 1951). As can be seen from Table 4, the reliability of calibrated 
ratings from a single reviewer (.68) is substantially higher than the observed rating from a 
single reviewer (.41). A small percentage of the abstracts were indeed reviewed by onlv a 
single revie\yer and for those the single reviewer reliabilities are most accurate. However,' as 
the vjist majority of abstracts were reviewed by two reviewers, in general the two reviewer 
reliability better represents the overall reliability of the review process. The two reviewer 
reliability for observed ratings is so low (i.e., r < .60) that a great deal of error would arise 
were observed ratings used as the basis for selection of papers for the program. While 
imperfect, the two reviewer reliability for calibrated ratings (.81) indicates that these ratings 
provide a good basis for selecting papers for the program. 

Table 4 
Reliability of Ratings 
Intra-Qass Correlations 

Single Reviewer Aggregate of Reviewers 
k=l k=2 
Observed Calibrated Observed Calibrated 



All Data 


.407 


.680 


.579 


.810 


Subset 








AYE 


.410 


.685 


.582 


.813 


BEE 


.423 


.678 


.595 


.808 


CEE 


.383 


.679 


.554 


.809 



Although consistency among reviewers, represented as an intra-class correlation (Ebel, 
1951; Stanley, 1961), is frequently interpreted as a measure of reliability, it may also be 
interpreted as a measure of validity. c,ianley (1961) observed that each reviewer may be 
considered a different method of measuring a given construct (e.g., abstract quality). 
Therefore, the single rater reliabilities (intra-class correlations) reported in Table 4 may be 
equally well interpreted as both single rater reliability coefficients and single rater validity 
coefficients. However, validity (Equation 4) does not expand as rapidly as does reliability 
(Equation 3) with increased numbers of independent observations. (GuUiksen, 1950). 

1 ,^ (Equation 4) , 
rxy,k = (rxy * + ((k-1) * r^)V2) 

where r™ is the validity based on k independent raters; 
rjjy is'ffie validity of a single rater; 

is the reliability of a single rater; and, 
k IS the number of independent reviewers/ratings. 

Table 5 reports the validity of ratings from a single reviewer and the aggregate of ratings 
from two reviewers as measures of abstract quality. As discussed above, the validities 
associated with a single reviewer's observed ana calibrated ratings are in this special case 
equal to the corresponding reliabilities associated with a single reviewer's observed and 
calibrated ratings reported in Table 4. As with reliability, a non-trivial improvement in 
convergent construct validity was obtained by calibrated ratings when contrasted with 
observed ratings. 

Given these results, the calibrated ratings offer a more reliable and valid basis for making 
decisions about disposition of the research abstracts. The work of the Abstract Selection 
Committee was facilitated by sorting these caliabrated ratings into descending rank order and 
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.407 


.680 


.485 


.742 


.410 


.685 


.488 


.746 


.423 


.678 


.501 


.740 


.383 


.679 


.461 


.741 



printing them along with their abstract identification numbers and mean observed ratings. 
Tills hst as well as Tables 1 and 4 and the reports from OTS-PR were forwarded to members 
of the Abstract Selection Committee. All decisions about abstracts to be included in the 
program were made by the Abstract Selection Committee. 

Table 5 
Validity of Ratings 

Single Reviewer Aggregate of Reviewers 
k= 1 k = 2 

Observed Calibrated Observed Calibrated 

All Data 
Subset 

AYE 
BEE 
CEE 

When the 503 research abstracts were examined by the Abstract Selection Committee 
they found that of these only 500 were unique (3 abstracts had been given two identification 
numbers and were reveiwed by two sets of two reviewers) and that another nine should be 
considered as congress related topics. Thus, there were 491 research abstracts considered 
for inclusion in the program. Of these, the Abstract Selection Committee selected 302 for 
inclusion in the program (94 as paper presentations and 208 as poster presentations). 

One way of depicting the impact of using calibrated rather than mean observed ratings in 
making the program selection decision is shown in Table 6. Using the simplified decision 
rule of accepting the top 302 rated research abstracts regardless of any other considerations 
would have resulted in 64 being accepted under one measure and rejected under the other. If 
abstract selection had been based only on the judged quality of an abstract (the task 
completed by each reviewer on each abstract which they reviewed), then use of the calibrated 
ratings rather than the mean observed ratings could have produced as great as a 21% 
difference in the specific abstracts selected for the program. 

Table 6 

Transitions in Selection Outcome Resulting from 
Using Calibrated or Mean Observed Ratings 

Outcome Based on Calibrated Rating 

Outcome Based on Observed Rating 

Select 
Reject 
Total 

However, reviewers provide only one level of evaluative information. They are asked to 
judge only the abstract under consideration in terms of its quality. They make these 
judgements independent of such other considerations as comprehensiveness or 
representativeness of the final program. These other considerations in abstract selection are 
evident when one examines the correlations between disposition and mean observed ratings 
(r = .47; N = 491) and calibrated ratings (r = .69; N = 491). The Abstract Selection 
Committee used a combination of decision rules in making decisions about the inclusion or 
exclusion of abstracts in the program including (a) selecting the top rated abstracts, (b) 
selecting only a single abstract from an author with multiple submissions, (c) selecting 



Select 


Reject 


Total 


238 


64 


302 


64 


125 


189 


302 


189 


491 



1^ 
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Table 7 

Research Abstracts Selected or Not for Inclusion in the Program 
Means, Standard Deviations, and Ranges of Observed and Calibrated Ratings 

- , T , . Observed Ratings Calibrated Ratings 

oCiCCtcu iOr InciuMun 
N = 302 

Mean 72.9 74.9 

Standard Deviation 11.9 lO.O 

Minimum-Maximum 44.1-98.3 54^0-99.7 
Not Selected for Inclusion 
N = 189 

Mean 59.3 52.9 

Standard Deviation 13.4 12.9 

Minimum-Maximum 28.3-95.0 7.1-84.9 

abstracts so as to achieve representativeness of participating countries. The use of such 
decision rules in making decisions about the program is also reflected in the means and 
standard deviations shown in Table 7. Those abstracts selected for inclusion in the program 
had mean observed and mean calibrated ratings well above that of those abstracts not 
included. But, there was overlap in the range of ratines of abstracts accepted and not. The 
range of the calibrated ratings of abstracts selected for inclusion was from 54 to 99.7 while for 
those not selected it was 7.1 to t.9. However, of those not selected for inclusion all with 
calibrated ratings of greater than 61.8 were authored by individuals who had multiple 
subnussions and had had one of the other research abstracts selected for inclusion. Excluding 
the ratings of other abstracts of authors having one abstract selected, there was only a small 
overlap in the maximum calibrated rating of an abstract that was rejected (61.8) and the 
mimmum of one selected (54.0). This small overlap results from the other factors, e.g., 
balance relative to countries, that were considered. 

Discussion and Conclusions 

Application of our simplified model to these data revealed that about half (.40) of the 
^ variance accounted for was attributable to differences among reviewers and that 

those differences in reviewer stringency were statistically significant. These reviewer effects 
were stronger than those observed in previous research (Cason, Cason, & Stritter, 1986a) 
where reviewer effects accounted for only .117, .189, and .144 of the variance. The 
proportion of the variance attributable to abstract quality in this study (.407) was highly 
similar to that found in the earlier study (.459, .393, and .415). 

Table 8 

Single Reviewer Reliability and Validity 
for Mean Observed and Calibrated Ratings 

Reliability & Validity 

Observed Calibrated 

Rating Rating 

Sigma Theta Tau .407 680 
Cason et al (1986) 

AERA 1983 .459 .520 

AERA 1985 .393 .485 

AERA 1986 .415 .485 

Marsh & Ball (1981) .340 .350 
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Calibration of ratings, i.e., removing the large and significant reviewer effects, yielded 
ratinjgs of research abstracts more reflective of their true quality. These results are also 
consistent with those reported previously. As shown in Table 8, the single reviewer 
reliabilities and validities for observed ratings are about the same as those found by Marsh 
and Ball (1981) and Cason, Cason, and Stritter (1986a). Calibration of ratings improved the 
Single reviewer reliabilities and validities in all cases. In the Sigma Theta Tau data, the 
improvement in reliability and validity was noticably greater than in either of the other data 
sets. This was a direct result of a larger component of variance being associated with 
systematic rater bias, i.e., reviewer stringencies, in the Sigma Theta Tau data. 

In each of these studies abstracts^proposals/manuscripts were reviewed by more than a 
single reviewer. Since multiple reviewers were used in each case, the reliabilities and 
validities of the peer review process are those reported for the aggregate of reviewers. These 
reliabilities and validities are shown in Table 9. As shown in Table 9, the mean observed 
ratings from the Sigma Theta Tau data had the lowest reliability and the calibrated ratings 
the second hi^est of the three studies. Thus application of the model to the Sigma Theta 
Tau data obtained the largest improvement in reliability of the overall process even though 
only two reviewers were used as compared with four reviewers in the Cason et al study. The 
pattern of results for validity is similar. The validity of the observed mean ratings for the 
Sigma Theta Tau data was lowest. However, calibration of Sigma Theta Tau ratings yielded 
much greater improvements than those found in the other studies even though only two 
reviewers reviewed each abstract Had mean observed ratings been used for selection, the 
Sigma Theta Tau review process would have been the weakest (i.e., least reliable and valid) 
of these studies. Using the calibrated ratings probably resulted in the Sigma Theta Tau 
review process being the most valid among these studies. 

Table 9 

Reliability and Validity of the Peer Review Process 

Number Reliability Validity 

of Observed Calibrated Observed Calibrated 
Study Reviewers Rating Rating Rating Rating 

Sigma Theta Tau 2 .579 .810 .485 .742 
Cason et al (1986) 

AERA1983 4 .768 .813 .595 .650 

AERA1985 4 .722 .790 .532 .619 

AERA1986 4, .739 .790 .554 .619 

Marsh «fe Ball (1981) 3 .670 .683 .509 .522 

'includes journal editor. 

The availability of calibrated ratings to the Committee greatly eased the Committee's 
task: they no longer had to deal with ratings which were confounded by variation in reviewer 
stringency (as occurs in the mean observed ratings^ Abstract selection could be based on 
ratings that more accurately reflected the quality of the abstract without regard to who 
happened to have reviewed it. Use by the Abstract Selection Committee of these calibrated 
ratings in making selection decisions greatly enhanced both the reliability and validity of the 
peer review process. 
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